1. Load data

2. Handling missing data

2A. Calculate the percentage of missing values in each column

2B. Clean datetime format

2C. Handling missing TailNum

2D. Handling missing CRSElapsedTime

2E. Handling missing ActualElapsedTime

2F. Handling missing clean_ArrTime

2G. Handling missing AirTime

2H. Handling missing TaxiIn and TaxiOut

2I. Handling missing ArrDelay

2J. Handling missing CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay

2K. Validate if we have already treated all missing values

3. Exploratory Data Analysis

3G. Filter the dataset by last 3 months only

4. Feature engineering

4A. Extract the hour from CRSDepTime and CRSArrTime

4B. Extract DayOfWeek Category to identify if a flight is on weekday or weekend

4C. Group CRSDepTime and CRSArrTime by whether it is a peak hour or not

4C. Calculate total delay in minutes

4D. Show the relationship between numerical variables and target variables

4E. Transform categorical variables into dummy variables

5. Prediction model on flight cancellation problem

5A. Data splitting

Describe the training and test set

5B. Data scaling on continuous variables

5D . Model building

Model 1 - Logistics Regression

Model 2 - Decision Tree

Model 3 - Random Forest

Model 4 - Ada Boosting

Model 5 - Gradient Boosting

5E. Model evaluation

6. Results

Random Forest showed the highest accuracy at 74.67%. However, Gradient Boosting showed the highest AUC score (0.76). For some business problems, choosing the best model based on the accuracy rate might be the most appropriate approach. However, for flight cancellation problems with severe class imbalance, it is more appropriate to choose the best performing model based on the AUC score, because it indicates the ability of a classifier to distinguish between classes. When the AUC is low, it means that the classifiers only make predictions randomly without actually having the capability to differentiate the classes. Even though the AUC score of these prediction models was not excellent (>0.8), it is still acceptable to deploy the model into production because at this rate, the model is considered to be sensitive to cancellation and delays. In other words, the model tends to classify a flight as a cancelled flight when it is not, which is better than failing to spot a cancelled or delayed flight.

In terms of CPU time, which denoted the time elapsed of running the model, Random Forest ran longer than the other algorithms (762.19 seconds), resulting in relatively high AUC (0.75), but low accuracy rate (71.94%). Meanwhile, the least performing algorithm was Logistics Regression, showing the lowest AUC (0.72) and lowest accuracy rate (72.41%). Although it requires the shortest time to run, the result is not as good as the other classifiers.

Overall, we can conclude that different algorithms might result in different performances. As we can measure model performance by various parameters, it is highly important to thoroughly understand the problem to correctly evaluate the model and draw conclusions out of it.